Given observed samples x from a distribution of interest, the goal of a generative model is to learn to model its true data distribution p(x). This learned model can be used to generate new samples or evaluate the likelihood of observed/sampled data.
We usually think of the data we observed as represented or generated by an associated unseen latent variable, which can be denoted by random variable z.
2. Understand Evidence Lower Bound (ELBO)
We can imagine the latent variables and the data we observe as modeled by a joint distribution p(x,z). The likelihood-based modeling learn a model to maximize the likelihood p(x) of all observed x. There are two ways to recover p(x):
Explicitly marginalize out the latent variable z:
Difficult because integrating out all latent variables z is intractable for complex models.
Chain rule of probability:
Difficult because we have no access to a ground truth latent encoder p(z∣x).
We call the log likelihood logp(x) by “evidence”. Then we can derive a term called the Evidence Lower Bound (ELBO), which is a lower bound of the evidence. Maximizing the ELBO becomes a proxy objective with which to optimize a latent variable model.
Formally, the equation of the ELBO is:
Its relationship with the evidence (log likelihood) is written as:
Here qϕ(z∣x) is a flexible approximate variational distribution with parameters ϕ that we seek to optimize. It seeks to approximate the true posterior p(z∣x).
How to derive the ELBO? (Why ELBO is an objective we would like to maximize).
To better understand the relationship between the evidence and the ELBO, here is another derivation:
From equation (15) we observe that the evidence equal to the ELBO plus the KL Divergence between the approximate posterior qϕ(z∣x) and the true posterior p(z∣x). Note that the left hand side of equation (15) is the evidence, which is a constant w.r.t the model parameter ϕ. Since the ELBO and the KL Divergence always sum up to a constant, any maximization of the ELBO term with respect to ϕ necessarily invokes an equal minimization of the KL Divergence term. Thus, the ELBO can be maximized as a proxy for learning how to perfectly model the true latent posterior distribution.
3. Variational AutoEncoder (VAE)
In the default VAE, we directly maximize the ELBO.
It is ‘variational’ because we optimize for the best qϕ(z∣x) among a family of potential posterior distributions parameterized by ϕ.
It is an ‘autoencoder’ because the input data is trained to predict itself after undergoing an intermediate bottlenecking representation step.
Now, we dissect the ELBO term further:
We learn an intermediate bottlenecking distribution qϕ(z∣x) that can be treated as an encoder. We also learn a deterministic function pθ(x∣z) to convert a given latent vector z into an observation x, which can be interpreted as a decoder.
The first term measures the reconstruction likelihood of the decoder from our variational distribution. The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Maximizing ELBO is thus equivalent to maximizing its first term and minimizing the second term.
The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian:
Then the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. A new objective can be rewritten as:
where the latents {z(l)}l=1L are sampled from qϕ(z∣x), for every observation x in the dataset. A problem of this method is that each z(l) that the loss is computed on is generated by sampling a multivariate Gaussian N(z;μϕ(x),σϕ2(x)I), which is non-differentiable. This can be addressed by the reparameterization trick:
The reparameterization trick disentangle the model parameter ϕ that we need to perform gradient descent on and the non-differentiable sampling process. In this case, we only need to sample from a standard Gaussian, which doesn’t involve any trainable parameters.
4. Hierarchical Variational Autoencoders (HVAE)
The general HVAE has T hierarchical levels, with each latent is allowed to condition on all previous latents. Here we focus on a special case: Markovian HVAE (MHVAE). in a MHVAE, the generative process is a Markov chain. Mathematically, we represent the joint distribution and the posterior of a Markovian HVAE as:
Then the ELBO can be extended to:
We can plug in equation (23) and (24) into equation (28) and get:
This equation can be further decomposed into interpretable components when we investigate Variational Diffusion Models.